Matching Elements Reference
Overview
Matching Elements are the components that perform pattern recognition and content analysis to identify sensitive data. They work in conjunction with Policy Elements to create comprehensive detection rules that can identify various types of sensitive information with high accuracy.
Element Categories
Keyword Matching Elements
Keyword elements match specific terms or phrases within content.
Keyword Element Structure
<Keyword id="keyword-identifier">
<Group matchStyle="word">
<Term>keyword1</Term>
<Term>keyword2</Term>
<Term>keyword3</Term>
</Group>
</Keyword>
Keyword Attributes
| Attribute | Type | Description | Required | Values |
|---|---|---|---|---|
id | String | Unique identifier for the keyword list | Yes | Must be unique within rule pack |
Group Attributes
| Attribute | Type | Description | Required | Default | Values |
|---|---|---|---|---|---|
matchStyle | String | How keywords should be matched | No | word | word, string, regex |
Match Styles
| Style | Description | Example | Use Case |
|---|---|---|---|
word | Match whole words only | "account" matches "account number" but not "accountability" | Precise term matching |
string | Match substring anywhere | "account" matches "accountability" | Broader pattern matching |
regex | Treat terms as regular expressions | \baccount\b for word boundaries | Complex pattern matching |
Keyword Examples
Financial Terms:
<Keyword id="financial-keywords">
<Group matchStyle="word">
<Term>account</Term>
<Term>balance</Term>
<Term>payment</Term>
<Term>transaction</Term>
<Term>routing</Term>
<Term>deposit</Term>
</Group>
</Keyword>
Personal Information:
<Keyword id="personal-keywords">
<Group matchStyle="word">
<Term>social security</Term>
<Term>SSN</Term>
<Term>date of birth</Term>
<Term>DOB</Term>
<Term>driver license</Term>
</Group>
</Keyword>
Regular Expression Elements
Regular expression elements provide powerful pattern matching capabilities for structured data.
Regex Element Structure
<Regex id="regex-identifier">
<Pattern>regular-expression-pattern</Pattern>
</Regex>
Common Regex Patterns
Social Security Numbers:
<Regex id="ssn-pattern">
<Pattern>\b\d{3}-?\d{2}-?\d{4}\b</Pattern>
</Regex>
Credit Card Numbers:
<Regex id="credit-card-pattern">
<Pattern>\b(?:\d{4}[-\s]?){3}\d{4}\b</Pattern>
</Regex>
Phone Numbers:
<Regex id="phone-pattern">
<Pattern>\b(?:\+?1[-.\s]?)?\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}\b</Pattern>
</Regex>
Email Addresses:
<Regex id="email-pattern">
<Pattern>\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b</Pattern>
</Regex>
IP Addresses:
<Regex id="ip-address-pattern">
<Pattern>\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b</Pattern>
</Regex>
Built-in Function Elements
Built-in functions provide validated pattern matching with additional verification logic.
Available Functions
| Function Name | Description | Validation Method |
|---|---|---|
Func_credit_card_formatted | Credit card numbers with formatting | Luhn algorithm + format check |
Func_credit_card_unformatted | Credit card numbers without formatting | Luhn algorithm |
Func_ssn_formatted | SSN with dashes (XXX-XX-XXXX) | Format validation |
Func_ssn_unformatted | SSN without formatting (XXXXXXXXX) | Format validation |
Func_us_phone_formatted | US phone with formatting | Format validation |
Func_us_phone_unformatted | US phone without formatting | Format validation |
Func_email_address | Email addresses | RFC 5322 validation |
Func_ip_address | IP addresses | IPv4/IPv6 validation |
Function Usage
<!-- Credit card with Luhn validation -->
<IdMatch idRef="Func_credit_card_formatted"/>
<!-- SSN with format validation -->
<IdMatch idRef="Func_ssn_formatted"/>
<!-- Email with RFC validation -->
<IdMatch idRef="Func_email_address"/>
Localized String Elements
Localized strings provide language-specific keyword matching.
LocalizedStrings Structure
<LocalizedStrings>
<Resource idRef="financial-terms">
<Name default="true" langcode="en">Financial Terms</Name>
<Name langcode="es">Términos Financieros</Name>
<Description default="true" langcode="en">Common financial terminology</Description>
<Description langcode="es">Terminología financiera común</Description>
</Resource>
</LocalizedStrings>
Localized Keywords
<Keyword id="payment-terms-localized">
<Group matchStyle="word" langcode="en">
<Term>payment</Term>
<Term>invoice</Term>
<Term>bill</Term>
</Group>
<Group matchStyle="word" langcode="es">
<Term>pago</Term>
<Term>factura</Term>
<Term>cuenta</Term>
</Group>
<Group matchStyle="word" langcode="fr">
<Term>paiement</Term>
<Term>facture</Term>
<Term>compte</Term>
</Group>
</Keyword>
Advanced Matching Techniques
Proximity-Based Matching
Elements can be configured to work together within specified distances.
<Entity id="credit-card-with-context" patternsProximity="300">
<Pattern confidenceLevel="90">
<IdMatch idRef="Func_credit_card_formatted"/>
<Match idRef="credit-card-keywords"/>
</Pattern>
</Entity>
Conditional Matching
Use logical operators to create complex matching conditions.
Any Element
Match any of several patterns:
<Any minMatches="2">
<Match idRef="financial-keywords"/>
<Match idRef="payment-keywords"/>
<Match idRef="banking-keywords"/>
</Any>
All Element
Require all patterns to match:
<All>
<Match idRef="ssn-pattern"/>
<Match idRef="personal-keywords"/>
<Match idRef="government-keywords"/>
</All>
Not Element
Exclude certain patterns:
<Pattern confidenceLevel="80">
<IdMatch idRef="ssn-pattern"/>
<Match idRef="personal-context"/>
<Not>
<Match idRef="test-data-keywords"/>
</Not>
</Pattern>
Case Sensitivity
Control case sensitivity for string matching:
<Keyword id="case-sensitive-terms">
<Group matchStyle="string" caseSensitive="true">
<Term>API</Term>
<Term>SQL</Term>
<Term>XML</Term>
</Group>
</Keyword>
<Keyword id="case-insensitive-terms">
<Group matchStyle="word" caseSensitive="false">
<Term>confidential</Term>
<Term>secret</Term>
<Term>private</Term>
</Group>
</Keyword>
Pattern Validation
Format Validation
Built-in functions provide format validation for common data types:
<!-- Validates credit card format and Luhn checksum -->
<IdMatch idRef="Func_credit_card_formatted"/>
<!-- Validates SSN format (XXX-XX-XXXX) -->
<IdMatch idRef="Func_ssn_formatted"/>
<!-- Validates email format per RFC 5322 -->
<IdMatch idRef="Func_email_address"/>
Custom Validation
Create custom validation using regex with specific constraints:
<!-- US ZIP code validation -->
<Regex id="us-zip-code">
<Pattern>\b\d{5}(?:-\d{4})?\b</Pattern>
</Regex>
<!-- Canadian postal code validation -->
<Regex id="canadian-postal-code">
<Pattern>\b[A-Za-z]\d[A-Za-z][-\s]?\d[A-Za-z]\d\b</Pattern>
</Regex>
Performance Optimization
Keyword Optimization
- Use Specific Terms: Prefer specific over generic keywords
- Limit List Size: Keep keyword lists under 1000 terms
- Group Related Terms: Organize keywords logically
- Use Word Matching: Prefer word matching over string matching
Regex Optimization
- Anchor Patterns: Use
\bfor word boundaries - Avoid Backtracking: Use non-capturing groups
(?:) - Limit Quantifiers: Avoid excessive
*and+operators - Test Performance: Validate regex performance with large content
Pattern Ordering
Order patterns by selectivity (most specific first):
<Entity id="optimized-detection">
<!-- Most specific pattern first -->
<Pattern confidenceLevel="95">
<IdMatch idRef="Func_credit_card_formatted"/>
<Match idRef="specific-keywords"/>
</Pattern>
<!-- Less specific patterns follow -->
<Pattern confidenceLevel="75">
<IdMatch idRef="credit-card-regex"/>
<Any minMatches="2">
<Match idRef="general-keywords"/>
<Match idRef="context-keywords"/>
</Any>
</Pattern>
</Entity>
Common Patterns Library
Financial Data Patterns
<!-- Credit Card Numbers -->
<Regex id="visa-pattern">
<Pattern>\b4\d{3}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b</Pattern>
</Regex>
<Regex id="mastercard-pattern">
<Pattern>\b5[1-5]\d{2}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b</Pattern>
</Regex>
<Regex id="amex-pattern">
<Pattern>\b3[47]\d{2}[-\s]?\d{6}[-\s]?\d{5}\b</Pattern>
</Regex>
<!-- Bank Account Numbers -->
<Regex id="bank-account-pattern">
<Pattern>\b\d{8,17}\b</Pattern>
</Regex>
<!-- Routing Numbers -->
<Regex id="routing-number-pattern">
<Pattern>\b[0-9]{9}\b</Pattern>
</Regex>
Personal Information Patterns
<!-- Driver License Numbers -->
<Regex id="drivers-license-pattern">
<Pattern>\b[A-Z]{1,2}\d{6,8}\b</Pattern>
</Regex>
<!-- Passport Numbers -->
<Regex id="passport-pattern">
<Pattern>\b[A-Z]{2}\d{7}\b</Pattern>
</Regex>
<!-- Medical Record Numbers -->
<Regex id="medical-record-pattern">
<Pattern>\bMRN[-\s]?\d{6,10}\b</Pattern>
</Regex>
Technical Patterns
<!-- API Keys -->
<Regex id="api-key-pattern">
<Pattern>\b[A-Za-z0-9]{32,64}\b</Pattern>
</Regex>
<!-- Database Connection Strings -->
<Regex id="connection-string-pattern">
<Pattern>(?i)(?:server|data source|host)=.+?(?:;|$)</Pattern>
</Regex>
<!-- File Paths -->
<Regex id="file-path-pattern">
<Pattern>(?:[A-Za-z]:\\|/)(?:[^\\/:*?"<>|\r\n]+[\\\/])*[^\\/:*?"<>|\r\n]*</Pattern>
</Regex>
Testing and Validation
Pattern Testing
- Positive Tests: Verify patterns match intended content
- Negative Tests: Ensure patterns don't match unintended content
- Edge Cases: Test boundary conditions and special characters
- Performance Tests: Measure execution time with large content
Validation Checklist
- Regex patterns are syntactically correct
- Keyword lists contain relevant terms
- Match styles are appropriate for use case
- Case sensitivity settings are correct
- Proximity values are reasonable
- Performance is acceptable with test content
Best Practices
Design Guidelines
- Start Simple: Begin with basic patterns, add complexity gradually
- Use Built-ins: Prefer built-in functions over custom regex when available
- Test Thoroughly: Validate with representative content samples
- Document Patterns: Include comments explaining complex regex patterns
Performance Guidelines
- Optimize Regex: Use efficient regex patterns
- Limit Scope: Use appropriate proximity values
- Order Elements: Place most selective patterns first
- Monitor Performance: Track execution times and resource usage
Maintenance Guidelines
- Version Control: Track pattern changes over time
- Regular Review: Periodically assess pattern effectiveness
- Update Keywords: Keep keyword lists current with evolving terminology
- Performance Monitoring: Watch for degradation over time